ASA DataFest 2024 Workshop

Module 01: Introduction to R and Data Wrangling

Iris Jiang & Thomas Fung

School of Mathematical and Physical Sciences

Acknowledgement

Prerequisites

  • This is an introductory to R workshop:
    • Assumes no prior knowledge of how to use R
    • We do assume you know why you want to learn R
    • Relatively slow-paced
      • Ask questions at any time
      • Collaboration is encouraged

Artwork by @allison_horst

Artwork by @allison_horst

Introduction to R and RStudio

Introduction to R

  • What is R?
    • R is a free language and environment for statistical computing and graphics.
    • R is modular — most functionality is from add-on packages. So the language can be thought of as a platform for creating and running a large number of useful packages.

Setup Step 0: Head to Posit Website

Setup Step 1: Install Base R

  • Use the button/link to download and install R.
    • (preferably latest R version 4.3.3 (2024-02-29))
    • If you have a Mac, install the latest release from the newest R-x.x.x.pkg (for Intel based, i.e. older Macs) or R-x.x.x-arm64.pkg (for Apple silicon based, i.e newer Macs, M1-3) link (or a legacy version if you have an older operating system). After installing R, you should also install XQuartz http://xquartz.macosforge.org to be able to use some visualisation packages.
    • If you are installing the Windows version, choose the “base” subdirectory and click on the download link at the top of the page. After you install R, you should also install RTools https://cran.rstudio.com/bin/windows/Rtools/; download the RTools installer file for your version of R.
    • If you are using Linux, choose your specific operating system and follow the installation instructions.

Setup Step 2: Install RStudio

  • Click on the second button to download and then install RStudio Desktop (Open Source License) version for your operating system.
  • Or you can scroll down to the All Installers and Tarballs section to find the appropriate installer for your operating system.

How does R work?

  • When R is running, variables, data, functions, results, etc., are stored in memory on the computer in the form of objects that have a name.
  • The user can perform actions on these objects with operators (arithmetic, logical, comparison, etc.) and functions (which are themselves objects).
  • Because R stores results in an object (a data structure), an analysis can be done with no result displayed. Such a feature is very useful, since a user can extract only that part of the results that is of interest and can pass results into further analyses.

Artwork by @allison_horst

Using R via RStudio

  • There are different ways of interacting with R.
  • R is like a car’s engine while RStudio is like a car’s dashboard. Much as we don’t drive a car by interacting directly with the engine but rather by interacting with elements on the car’s dashboard, we won’t be using R directly but rather we will use RStudio’s interface.
  • RStudio is a free, open source, and R-specific integrated development environment (IDE).
  • It provides a built in editor, works on all platforms (including on servers) and provides many advantages such as integration with version control and project management.

Basic Layout

Basic Layout (Cont.)

  • “Four quadrants” of RStudio:
    • By default, the upper left panel is the source panel, where you view and edit source code from files.
    • The bottom left panel is usually the interactive R console, where you can type in commands and view output messages.
    • The right panels have several different tabs that show you information about your code.
    • Layout can be changed via Tools > Global Options > Pane Layout.

More about the Interactive R Console

  • The interactive R console is where you will run all of your code, and can be a useful environment to try out ideas before adding them to an R script file.
  • This console in RStudio is the same as the one you would get if you opened the R application or you just typed in “R” in your command line environment.
  • This console operates on the idea of a “read, evaluate, and print loop”:
    • You type in your commands and hit enter;
    • R reads and interprets what you’ve written and tries to execute it;
    • R prints a result to the console.

Launch a Session – RStudio project

  • In RStudio go to File > New Project.
  • Choose Existing Directory and browse to the workshop materials directory on your desktop.
  • This will create an .Rproj file for your project and will automatically change your working directory to the workshop materials directory.

Artwork by @allison_horst

R and its packages

  • Many useful R function come in packages, free libraries of code written by R’s active user community.
  • Traditionally, to install an R package, open an R session and type at the Console:
install.packages("<name of the package>")
  • R will download the package from CRAN, so you’ll need to be connected to the internet.
  • Once you have a package installed, you can make its contents available to use in your current R session by running
library(<name of the package>) # or library("<name of the package>")
  • For example, to install and use the pak package, you would run:
install.packages("pak")
library(pak)

Why pak?

  • pak provides a fresh approach to R package installation.
  • pak installs R packages from CRAN, Bioconductor, GitHub, URLs, git repositories, local files and directories with a single funciton.
  • pak is:
    • ⚡ Fast - parallel downloads and installation, caching, etc.
    • 🦺 Safe - dependency solver, system dependency solver, etc.
    • 🏪 Convenient - packages from multiple sources, etc.
  • This means you should use
pak::pak("tidyverse")

instead of

install.packages("tidyverse")

unless pak::pak() gives you an error.

R Objects – Vectors

Naming rules

  • Variable names should start with a letter (A-Z and a-z) and can include
    • letters
    • digits (0-9)
    • dots (.)
    • underscores (_)
  • Different conventions for long variable names include
    • underscores_between_words
    • periods.between.words
    • camelCaseToSeparateWords

Tip

What you use is up to you, but be consistent, and remember that you’re likely going to be typing these out quite a few times, so try to be concise too.

Atomic Vectors

An atomic vector is just a simple vector of data. You can make an atomic vector by combining multiple elements with c():

die <- c(1, 2, 3, 4, 5, 6)
die
[1] 1 2 3 4 5 6
is.vector(die)
[1] TRUE
typeof(die)
[1] "double"
length(die)
[1] 6

Note

  • You can create new objects with the assignment operator <-.
  • object_name <- value
  • It can be read as “object name gets value”.
  • The name on the left gets the object on the right.

Important types – Logical

  • Logical vectors are the simplest type of atomic vector because they can take only three possible values: FALSE, TRUE, and NA. Logical vectors are usually constructed with comparison operators.

  • Comparison operators include: >, >=, <, <=, != (not equal), and == (equal).

a <- c(1:10)
a
 [1]  1  2  3  4  5  6  7  8  9 10
a %% 2 == 0
 [1] FALSE  TRUE FALSE  TRUE FALSE  TRUE FALSE  TRUE FALSE  TRUE

Warning

  • When you’re starting out with R, one of the easiest mistakes to make is to use = instead of == when testing for equality.
  • = is an assignment operator (but <- is the preferred assignment operator).

Important types – Numeric

  • Integer and double vectors are known collectively as numeric vectors.
  • In R, numbers are doubles by default.
  • To make an integer, place an L after the number.
a <- runif(10)
a
 [1] 0.7437362 0.1606890 0.2471166 0.2325289 0.8924933 0.5339667 0.5963321
 [8] 0.2648153 0.4977776 0.8967283
typeof(a)
[1] "double"
b <- c(1L:10L)
b
 [1]  1  2  3  4  5  6  7  8  9 10
typeof(b)
[1] "integer"

Important types – Character

  • A character vector stores small pieces of text.
  • You can create a character vector in R by typing a character or string of characters surrounded by quotes.
  • The individual elements of a character vector are known as strings. Note that a string can contain more than just letters.
text <- c("Hello", "World")
text
[1] "Hello" "World"
typeof(text)
[1] "character"
num_text <- c("1", "2", "three")
num_text
[1] "1"     "2"     "three"
typeof(num_text)
[1] "character"

Note

Logicals, numerics, and characters are the most common types of atomic vectors in R, but R also recognises two more types: complex and raw.

From base R to tidyverse

  • Installation
# install.packages("pak")
# uncomment the line above and run to install
pak::pak("tidyverse")
  • Usage
library(tidyverse)

  • You can check that all tidyverse packages are up-to-date with tidyverse_update().

Core tidyverse

  • The tidyverse is a set of packages that work in harmony because they share an underlying design philosophy, grammar, and data representations.

  • The core tidyverse includes the packages that you’re likely to use in everyday data analyses.

  • As of tidyverse 2.0.0, the following packages are included in the core tidyverse.

Core tidyverse (Cont.)

  • library(tidyverse) will load the core tidyverse packages:
    • ggplot2 for data visualisation
    • dplyr for data manipulation
    • tidyr for data tidying
    • readr for data import
    • purrr for functional programming
    • tibble for tibbles, a modern re-imagining of data frames
    • stringr for strings
    • forcats for factors
    • lubridate for dates and times

Core tidyverse to Phases of the Data Science Cycle

Tidy Data and Data Wrangling

Tidy Data

  • This is used wherever possible throughout R (and within tidyverse of @wickhamTidyverseEasilyInstall2022).

Artwork by @allison_horst

Why tidy data?

  • If you ensure that your data is tidy, you’ll spend less time fighting with the tools and more time working on your analysis.

Artwork by @allison_horst

tibble VS data.frame

  • A tibble is often considered a neater format of a data frame, and it is often used in the tidyverse packages.

  • It contains the same information as a data frame, but the manipulation and representation of tibbles is different from data frames in some aspects.

  • Comparing to data.frame(), tibble() does much less (and probably complain much more!)

    • It never changes the type of the inputs
    • It never changes the names of variables
    • It never creates row names

Artwork by @allison_horst

tibble VS data.frame (Cont.)

  • You can create a tibble from an existing data frame by using as_tibble()
library(tidyverse) # or
library(tibble)

df <- iris
class(df)
[1] "data.frame"
df_tib <- as_tibble(df)
df_tib
# A tibble: 150 × 5
   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
          <dbl>       <dbl>        <dbl>       <dbl> <fct>  
 1          5.1         3.5          1.4         0.2 setosa 
 2          4.9         3            1.4         0.2 setosa 
 3          4.7         3.2          1.3         0.2 setosa 
 4          4.6         3.1          1.5         0.2 setosa 
 5          5           3.6          1.4         0.2 setosa 
 6          5.4         3.9          1.7         0.4 setosa 
 7          4.6         3.4          1.4         0.3 setosa 
 8          5           3.4          1.5         0.2 setosa 
 9          4.4         2.9          1.4         0.2 setosa 
10          4.9         3.1          1.5         0.1 setosa 
# ℹ 140 more rows

Data Wrangling

  • We often use data wrangling interchangeably with data manipulation.

Artwork by @allison_horst

Subsetting rows with a logical vector – Base R


df_tib[df_tib$Species == "setosa", ]
# A tibble: 50 × 5
   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
          <dbl>       <dbl>        <dbl>       <dbl> <fct>  
 1          5.1         3.5          1.4         0.2 setosa 
 2          4.9         3            1.4         0.2 setosa 
 3          4.7         3.2          1.3         0.2 setosa 
 4          4.6         3.1          1.5         0.2 setosa 
 5          5           3.6          1.4         0.2 setosa 
 6          5.4         3.9          1.7         0.4 setosa 
 7          4.6         3.4          1.4         0.3 setosa 
 8          5           3.4          1.5         0.2 setosa 
 9          4.4         2.9          1.4         0.2 setosa 
10          4.9         3.1          1.5         0.1 setosa 
# ℹ 40 more rows

Note

  • There are quite a few different ways that you can use [ with a data frame, but the most important way is to select rows and columns independently with df[rows, cols].
  • df[rows, ] and df[ ,cols] select just rows or just columns, using the empty subset to preserve the other dimension.

Filter rows with dplyr::filter()

Artwork by @allison_horst

Filter rows with dplyr::filter()(Cont.)


df_tib %>%
  filter(Species == "setosa")

Note

  • filter() is equivalent to subsetting the rows with a logical vector, taking care to exclude missing values.
  • Almost all tidyverse packages import the magrittr package to use %>%.
  • x %>% f(y) turns into f(x, y) so the result from one step is then “piped” into the next step. You can use the pipe to rewrite multiple operations that you can read left-to-right, top-to-bottom.
  • When you see the pipe operator %>%, read it as “and then”.

Subsetting rows by index with dplyr::slice()


df_tib %>%
  slice(1:3)
# A tibble: 3 × 5
  Sepal.Length Sepal.Width Petal.Length Petal.Width Species
         <dbl>       <dbl>        <dbl>       <dbl> <fct>  
1          5.1         3.5          1.4         0.2 setosa 
2          4.9         3            1.4         0.2 setosa 
3          4.7         3.2          1.3         0.2 setosa 

Tip

  • slice_head() and slice_tail() select the first or last rows
  • slice_sample() randomly selects rows
  • slice_min() and slice_max() select rows with highest or lowest values of a variable

Subsetting columns with a character vector

  • Base R:
df_tib[df_tib$Species == "setosa", c("Sepal.Length", "Sepal.Width")]
# A tibble: 50 × 2
   Sepal.Length Sepal.Width
          <dbl>       <dbl>
 1          5.1         3.5
 2          4.9         3  
 3          4.7         3.2
 4          4.6         3.1
 5          5           3.6
 6          5.4         3.9
 7          4.6         3.4
 8          5           3.4
 9          4.4         2.9
10          4.9         3.1
# ℹ 40 more rows
df_tib %>%
  filter(Species == "setosa") %>%
  select(Sepal.Length, Sepal.Width)
  • Base R also provides a function that combines the features of filter() and select() called subset()
df_tib %>%
  subset(Species == "setosa", c(Sepal.Length, Sepal.Width))

dplyr: A Grammar of Data Manipulation

  • dplyr aims to provide a function for each basic verb of data manipulation.
  • These verbs can be organised into three categories based on the component of the dataset that they work with:
    • Rows:
      • filter() chooses rows based on column values.
      • slice() chooses rows based on location.
      • arrange() changes the order of the rows.
    • Columns:
      • select() changes whether or not a column is included.
      • rename() changes the name of columns.
      • mutate() changes the values of columns and creates new columns.
      • relocate() changes the order of the columns.
    • Groups of rows:
      • summarise() collapses a group into a single row.

Your mission, should you choose to accept it!

  • We are going to use the mtcars dataset.
  • Take a glimpse of it before answering the question below:
mtcars %>%
  glimpse() # tidyverse version of str()
  • What do you think the following will do?
rename(mtcars, miles_per_gallon = mpg)
arrange(mtcars, wt)

Tip

Get some help by typing ?dplyr::arrange() in Console

Adding or modifying a column: dplyr::mutate()

Artwork by @allison_horst

Adding or modifying a column: dplyr::mutate() (Cont.)


df_tib %>%
  mutate(sepal_ratio = Sepal.Length / Sepal.Width) %>%
  head()
# A tibble: 6 × 6
  Sepal.Length Sepal.Width Petal.Length Petal.Width Species sepal_ratio
         <dbl>       <dbl>        <dbl>       <dbl> <fct>         <dbl>
1          5.1         3.5          1.4         0.2 setosa         1.46
2          4.9         3            1.4         0.2 setosa         1.63
3          4.7         3.2          1.3         0.2 setosa         1.47
4          4.6         3.1          1.5         0.2 setosa         1.48
5          5           3.6          1.4         0.2 setosa         1.39
6          5.4         3.9          1.7         0.4 setosa         1.38

Tip

case_when() and if_else() are useful for conditional mutation.

tidyselect: Selection language

  • A backend for the selecting functions of the tidyverse. It provides helpers for selecting variables.

    • : for selecting contiguous variables.
    • ! for taking complement set of variables.
    • & or | for selecting intersection or union of two set of variables.
    • starts_with() selects columns with the given prefix.
    • ends_with() selects columns with the given suffix.
    • everything() to select all variables.
    • last_col() to select last variable, with option of an offset.
    • contains() selects columns with a literal string.
    • all_of() for selecting columns based on a character vector.
    • and many more!
    help(language, package = "tidyselect")

Your mission, should you choose to accept it!

  • In this question, we’ll use a dataset named babynames, which comes in a package that is also named babynames.
  • To access the data, we use
# pak::pak("babynames") # if you never use this package before
library(babynames)
  • Within babynames, you will find information about almost every name given to children in the United States since 1880.
babynames
# A tibble: 1,924,665 × 5
    year sex   name          n   prop
   <dbl> <chr> <chr>     <int>  <dbl>
 1  1880 F     Mary       7065 0.0724
 2  1880 F     Anna       2604 0.0267
 3  1880 F     Emma       2003 0.0205
 4  1880 F     Elizabeth  1939 0.0199
 5  1880 F     Minnie     1746 0.0179
 6  1880 F     Margaret   1578 0.0162
 7  1880 F     Ida        1472 0.0151
 8  1880 F     Alice      1414 0.0145
 9  1880 F     Bertha     1320 0.0135
10  1880 F     Sarah      1288 0.0132
# ℹ 1,924,655 more rows

Your mission, should you choose to accept it! (Cont.)

  • What do you think this line of code will do?
select(babynames, name, sex)
  • What’s wrong with the following 3 lines of code? Can you fix them?
filter(babynames, name = "Sea")
filter(babynames, name == Sea)
filter(babynames, 10 < n < 20)

Your mission, should you choose to accept it! (cont.)

  • Can you filter out
    • all of the names where prop is greater than or equal to 0.08;
    • all of the children named Khaleesi;
    • all the girls named Sea;
    • Names that were used by exactly 5 or 6 children in 1880;
    • Names that are one of Acura, Lexus, or Yugo? Hint: ? '%in%'

Join Data Sets

Motivating Example: nycflights13

  • Flight delays are an unfortunate aspect of air travel.
  • You may wonder: is it possible to predict which flights will be delayed?
  • The flight dataset in the nycflights13 package provides some relevant information.
# pak::pak("nycflights13") # only if you never use this package previously
library(nycflights13)
flights
# A tibble: 336,776 × 19
    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
   <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
 1  2013     1     1      517            515         2      830            819
 2  2013     1     1      533            529         4      850            830
 3  2013     1     1      542            540         2      923            850
 4  2013     1     1      544            545        -1     1004           1022
 5  2013     1     1      554            600        -6      812            837
 6  2013     1     1      554            558        -4      740            728
 7  2013     1     1      555            600        -5      913            854
 8  2013     1     1      557            600        -3      709            723
 9  2013     1     1      557            600        -3      838            846
10  2013     1     1      558            600        -2      753            745
# ℹ 336,766 more rows
# ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
#   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
#   hour <dbl>, minute <dbl>, time_hour <dttm>
  • It contains details of every flight that departed from an airport that serves New York City in 2013.

Motivating Example: nycflights13 (cont.)

  • Let’s use it to explore which airlines have the largest flight delays by comparing the average (arrival) delay time by airline:
flights %>%
  drop_na(arr_delay) %>%
  group_by(carrier) %>%
  summarise(avg_delay = mean(arr_delay)) %>%
  arrange(desc(avg_delay))
# A tibble: 16 × 2
   carrier avg_delay
   <chr>       <dbl>
 1 F9         21.9  
 2 FL         20.1  
 3 EV         15.8  
 4 YV         15.6  
 5 OO         11.9  
 6 MQ         10.8  
 7 WN          9.65 
 8 B6          9.46 
 9 9E          7.38 
10 UA          3.56 
11 US          2.13 
12 VX          1.76 
13 DL          1.64 
14 AA          0.364
15 HA         -6.92 
16 AS         -9.93 
  • This shows F9 had the worst record for delays in the New York City area in 2013.
  • But which airline has the carrier code F9?

Motivating Example: nycflights13 (cont.)

  • Alternatively, for per operation grouping to calculate grouped summary statistics
flights %>%
  drop_na(arr_delay) %>%
  summarise(
    avg_delay = mean(arr_delay),
    .by = carrier
  ) %>%
  arrange(desc(avg_delay))
# A tibble: 16 × 2
   carrier avg_delay
   <chr>       <dbl>
 1 F9         21.9  
 2 FL         20.1  
 3 EV         15.8  
 4 YV         15.6  
 5 OO         11.9  
 6 MQ         10.8  
 7 WN          9.65 
 8 B6          9.46 
 9 9E          7.38 
10 UA          3.56 
11 US          2.13 
12 VX          1.76 
13 DL          1.64 
14 AA          0.364
15 HA         -6.92 
16 AS         -9.93 

Motivating Example: nycflights13 (cont.)

  • Luckily, the nycflights13 package comes with another data set, airlines, which matches the name of each airline to its carrier code.
airlines
# A tibble: 16 × 2
   carrier name                       
   <chr>   <chr>                      
 1 9E      Endeavor Air Inc.          
 2 AA      American Airlines Inc.     
 3 AS      Alaska Airlines Inc.       
 4 B6      JetBlue Airways            
 5 DL      Delta Air Lines Inc.       
 6 EV      ExpressJet Airlines Inc.   
 7 F9      Frontier Airlines Inc.     
 8 FL      AirTran Airways Corporation
 9 HA      Hawaiian Airlines Inc.     
10 MQ      Envoy Air                  
11 OO      SkyWest Airlines Inc.      
12 UA      United Air Lines Inc.      
13 US      US Airways Inc.            
14 VX      Virgin America             
15 WN      Southwest Airlines Co.     
16 YV      Mesa Airlines Inc.         
  • While you could potentially look up F9 manually every time, you probably don’t want to do it that way.
  • A better solution would be to join the airlines data set to your results programatically.
  • This is easy to do with one of dplyr’s four join functions: left_join(), right_join(), full_join(), and inner_join().

Toy data

  • The easiest way to learn how join functions work is visually.
  • Here are some small toy datasets that we can visualize in their entirety: band_members and band_instruments, which look like this (the datasets are named band & instruments respectively in the images below):

Note

  • Notice that each data set has a column named name.
  • If you know a little about The Beatles, you should recognise them.
  • The rows named Mick and Keith do not match any rows in the other data set.
  • Finally, notice that the matching rows do not appear in the same place in each data set.
    • For example, John is in the second row of band, but the first row of instrument.

Toy data (cont.)

To see the raw data, you can run the following code (they are part of dplyr)

band_members
# A tibble: 3 × 2
  name  band   
  <chr> <chr>  
1 Mick  Stones 
2 John  Beatles
3 Paul  Beatles
band_instruments
# A tibble: 3 × 2
  name  plays 
  <chr> <chr> 
1 John  guitar
2 Paul  bass  
3 Keith guitar
  • These small data sets do a good job of matching the haphazard nature of real data.
  • Our task will be to join them into a single data set that correctly matches the John and Paul rows to each other.

Fantastic Four Vol 1: left_join()

  • The left_join() function returns a copy of a data set that is augmented with information from a second data set.
    • It retains all of the rows of the first data set, and only adds rows from the second data set that match rows in the first.

Important

  • Mick is retained in the result (with an NA in the appropriate spot) because Mick appears in the first data set.
  • Kieth does not appear in the result because Keith does not appear in the first data set.

Fantastic Four Vol 1: left_join() (cont.)

  • To obtain the same results in R, we do:
band_members %>%
  left_join(band_instruments, by = "name")
# A tibble: 3 × 3
  name  band    plays 
  <chr> <chr>   <chr> 
1 Mick  Stones  <NA>  
2 John  Beatles guitar
3 Paul  Beatles bass  
  • The by argument specifies the column that the two datasets have in common.

  • What if the column names are different in the two datasets?

band_members %>%
  left_join(band_instruments, by = c("name" = "name"))

Fantastic Four Vol 1: left_join() (cont.)

  • New join_by() function for the by argument
band_members %>%
  left_join(band_instruments, by = join_by(name == name))
  • The by argument can be omitted for succinctness
band_members %>%
  left_join(band_instruments, join_by(name == name))

Fantastic Four Vol 2: right_join()

  • right_join() does the opposite of left_join();

    • it retains every row from the second data set and only adds rows from the first data set that have a match in the second data set.

Important

  • Now Keith appears in the result because Keith appears in the second data set.
  • Mick does not appear in the result because he only appears in the first data set.

Fantastic Four Vol 2: right_join() (cont.)

  • To obtain the same results in R, we do:
band_members %>%
  right_join(band_instruments, by = "name")
# A tibble: 3 × 3
  name  band    plays 
  <chr> <chr>   <chr> 
1 John  Beatles guitar
2 Paul  Beatles bass  
3 Keith <NA>    guitar

Fantastic Four Vol 3: full_join()

  • A full_join() retains every row from each data sets, inserting NA placeholders throughout the results as necessary.
  • This is the only join that does not lose any information from the original data sets.

Important

  • Both Mick and Kieth appear in the results.

Fantastic Four Vol 3: full_join() (cont.)

  • To obtain the same results in R, we do:
band_members %>%
  full_join(band_instruments, by = "name")
# A tibble: 4 × 3
  name  band    plays 
  <chr> <chr>   <chr> 
1 Mick  Stones  <NA>  
2 John  Beatles guitar
3 Paul  Beatles bass  
4 Keith <NA>    guitar

Fantastic Four Vol 4: inner_join()

  • A inner_join() only retains row that appear in both datasets.

Important

  • John and Paul appear in the result and Mick and Keith are left behind.

Fantastic Four Vol 4: inner_join() (cont.)

  • To obtain the same results in R, we do:
band_members %>%
  inner_join(band_instruments, by = "name")
# A tibble: 2 × 3
  name  band    plays 
  <chr> <chr>   <chr> 
1 John  Beatles guitar
2 Paul  Beatles bass  

Your mission, should you choose to accept it!

  • Let’s finish what we started and complete our airlines query. Add two more lines to the code below:
flights %>%
  drop_na(arr_delay) %>%
  group_by(carrier) %>%
  summarise(avg_delay = mean(arr_delay)) %>%
  arrange(desc(avg_delay))
  • In the first, join the results to airlines in a way that keeps every row of the results, but only the matching rows of airlines.
  • In the second, select just the name and avg_delay columns in that order.

Your mission, should you choose to accept it! (cont.)

  • The answer is
flights %>%
  drop_na(arr_delay) %>%
  group_by(carrier) %>%
  summarise(avg_delay = mean(arr_delay)) %>%
  arrange(desc(avg_delay)) %>%
  left_join(airlines, by = "carrier") %>%
  select(name, avg_delay)
# A tibble: 16 × 2
   name                        avg_delay
   <chr>                           <dbl>
 1 Frontier Airlines Inc.         21.9  
 2 AirTran Airways Corporation    20.1  
 3 ExpressJet Airlines Inc.       15.8  
 4 Mesa Airlines Inc.             15.6  
 5 SkyWest Airlines Inc.          11.9  
 6 Envoy Air                      10.8  
 7 Southwest Airlines Co.          9.65 
 8 JetBlue Airways                 9.46 
 9 Endeavor Air Inc.               7.38 
10 United Air Lines Inc.           3.56 
11 US Airways Inc.                 2.13 
12 Virgin America                  1.76 
13 Delta Air Lines Inc.            1.64 
14 American Airlines Inc.          0.364
15 Hawaiian Airlines Inc.         -6.92 
16 Alaska Airlines Inc.           -9.93 

Manipulating date and time

Date in (base) R

  • Dealing with dates alone is relatively straightforward when compared to compared to date and time .

  • Dealing with date and time simultaneously is more tricky

  • Let’s start with just dates first

Sys.Date() # System Date, gets the date when the command is run
[1] "2024-04-17"
  • Dates in R have class Date 📅 even though it looks like character 🔢
class(Sys.Date())
[1] "Date"
  • It’s actually a numerical value under the hood
unclass(Sys.Date())
[1] 19830
  • So what is this number? 🤔

Date in (base) R (cont.)

  • ℹ️ 1st January 1970 is a special reference point
  • Let’s have a look at the numerical value under the hood of Date objects
unclass(as.Date("1970/01/02"))
[1] 1
unclass(as.Date("1969/12/31"))
[1] -1
  • The number under the hood is the number of days after (if positive) or before (if negative) 1st January 1970
  • You can use as.Date to convert objects to Date 📅

Date in (base) R (cont.)

  • Dates do no have to be in the format of “YYYY/MM/DD”
    • In fact, there are many format in the wild!
  • If it has a different format, then you can use the conversion specification with a “%” symbol followed by a single letter.
as.Date("Xmas is 25 December 2020",
  format = "Xmas is %d %B %Y"
)
[1] "2020-12-25"

Figure from https://xkcd.com/1179/

Date in (base) R (cont.)

  • You can find some widely used conversion specification in documentation at ?strptime but some depends on your operating system
  • Below are some common ones:
    • %b abbreviated month
    • %B full month
    • %e day of the month (01, 02, …, 31)
    • %d day of the month (1, 2, …, 31)
    • %y year without century (00-99)
    • %Y year with century, e.g. 1999

Date and Time in (base) R

  • Two main date-time classes in R: POSIXct and POSIXlt - try avoid using POSIXlt if possible
  • POSIX stands for Portable Operating System Interface
  • ct stands for calendar time
as.POSIXct("2020-12-02 13:00", format = "%Y-%m-%e %H:%M")
[1] "2020-12-02 13:00:00 AEDT"
unclass(as.POSIXct("2020-12-02 13:00", format = "%Y-%m-%e %H:%M"))
[1] 1606874400
attr(,"tzone")
[1] ""
  • ℹ️ 1970/01/01 00:00:00 UTC is a special reference point called Unix epoch and the above number is the number of seconds after Unix epoch

Date and Time in (base) R (cont.)

  • POSIXlt seems like it’s the same as POSIXct
as.POSIXlt("2020-12-02 13:00", format = "%Y-%m-%e %H:%M")
[1] "2020-12-02 13:00:00 AEDT"
  • But under the hood, it’s a list of time attributes
unclass(as.POSIXlt("2020-12-02 13:00", format = "%Y-%m-%e %H:%M"))
$sec
[1] 0

$min
[1] 0

$hour
[1] 13

$mday
[1] 2

$mon
[1] 11

$year
[1] 120

$wday
[1] 3

$yday
[1] 336

$isdst
[1] 1

$zone
[1] "AEDT"

$gmtoff
[1] NA

attr(,"tzone")
[1] ""     "AEST" "AEDT"
attr(,"balanced")
[1] TRUE

Time zone

syd <- as.POSIXct("2023-04-20 13:00",
  format = "%Y-%m-%e %H:%M",
  tz = "Australia/Sydney"
) #<<

perth <- as.POSIXct("2023-04-20 13:00",
  format = "%Y-%m-%e %H:%M",
  tz = "Australia/Perth"
) #<<

syd - perth
Time difference of -2 hours
  • You can find the names of the time zones using OlsonNames().
  • If you want to know which time zone your system is using:
Sys.timezone()
[1] "Australia/Sydney"

Working with lubridate

lubridate

  • Date-time data can be frustrating to work with in base R.
  • Lubridate makes it easier to do the things R does with date-times and possible to do the things R does not.

Artwork by @allison_horst

Date in R with lubridate

  • To convert string to a Date, you can use ymd and friends. E.g.
ymd("2012 Dec 30th")
[1] "2012-12-30"
mdy("01/30 99")
[1] "1999-01-30"
dmy("1st January 2015")
[1] "2015-01-01"

ymd and friends

  • You might have guessed it but:
    • y = year
    • m = month, and
    • d = day.
  • The order determines the expected order of its appearance in the string

Date and time in R with lubridate

  • To convert string to POSIXct, you can use ymd_hms and friends
ymd_hms("20140101 201001", tz = "Australia/Sydney")
[1] "2014-01-01 20:10:01 AEDT"
mdy_h("09/09/2010 4PM")
[1] "2010-09-09 16:00:00 UTC"
ydm_hm("Today is not 2009 9th Sep 4:30PM")
[1] "2009-09-09 16:30:00 UTC"
ydm_hms("19 9 July | 4:30:03.34343")
[1] "2019-07-09 04:30:03 UTC"
ymd_hms("2023-10-01 2:00:00", tz = "Australia/Sydney") # Why NA?
[1] NA

ymd_hms and friends

  • Here, we have
    • h = hour; m = minute, and s = second.
  • It’s remarkably clever!
  • The time has to be after date though.

Conversion to date and time with lubridate

  • Making Date from individual date components:
make_date(
  year = 2018,
  month = 8,
  day = 3
)
[1] "2018-08-03"
  • Making POSIXct from individual components:
make_datetime(
  year = 2018,
  month = 8,
  day = 3,
  hour = 10,
  min = 3,
  sec = 30
)
[1] "2018-08-03 10:03:30 UTC"

Extracting date or time components with lubridate


t1 <- ymd_hms("20101010 13:30:30")
month(t1, label = TRUE)
[1] Oct
12 Levels: Jan < Feb < Mar < Apr < May < Jun < Jul < Aug < Sep < ... < Dec
year(t1)
[1] 2010
month(t1)
[1] 10
day(t1)
[1] 10
hour(t1)
[1] 13
minute(t1)
[1] 30
second(t1)
[1] 30
yday(t1)
[1] 283
mday(t1)
[1] 10
wday(t1)
[1] 1

Date and time modifiers


month(t1) <- 3
t1
[1] "2010-03-10 13:30:30 UTC"
mday(t1) <- 20
t1
[1] "2010-03-20 13:30:30 UTC"
with_tz(t1, "Australia/Perth")
[1] "2010-03-20 21:30:30 AWST"

Durations with lubridate

  • Duration is a special class in lubridate which represents an exact number of seconds under the hood.
  • Some convenient constructors for Duration are:
dyears(1)
[1] "31557600s (~1 years)"
dweeks(10)
[1] "6048000s (~10 weeks)"
ddays(4)
[1] "345600s (~4 days)"
dhours(3)
[1] "10800s (~3 hours)"

Maths with Durations with lubridate


ddays(4) + dweeks(1)
[1] "950400s (~1.57 weeks)"
ymd("2013-01-01") + ddays(5)
[1] "2013-01-06"
ymd_hms("2024-04-06 2:30:00", tz = "Australia/Sydney") + ddays(1)
[1] "2024-04-07 02:30:00 AEDT"
  • What happened below?
ymd_hms("2024-04-07 1:00:00", tz = "Australia/Sydney") + dhours(3)
[1] "2024-04-07 03:00:00 AEST"
  • Day light saving ended at Sun 7th April 2024 3AM in Sydney
  • Day light saving is tricky to handle as one day each year has 23 hours and another has 25!

Period with lubridate

  • Period is another special class in lubridate which represent human units like weeks and months and without a fixed unit.
    • It must have integer values in its argument.
  • Constructors for Period are like for Duration but without the prefix “d”:
years(1)
[1] "1y 0m 0d 0H 0M 0S"
weeks(10)
[1] "70d 0H 0M 0S"
days(4)
[1] "4d 0H 0M 0S"
hours(3)
[1] "3H 0M 0S"

Period with lubridate (cont.)


days(4) + weeks(1)
[1] "11d 0H 0M 0S"
ymd("2013-01-01") + days(5)
[1] "2013-01-06"
ymd_hms("2023-04-03 2:00:00", tz = "Australia/Sydney") + days(1)
[1] "2023-04-04 02:00:00 AEST"
ymd_hms("2023-04-02 1:00:00", tz = "Australia/Sydney") + hours(1)
[1] "2023-04-02 02:00:00 AEDT"
ymd_hms("2023-04-02 1:00:00", tz = "Australia/Sydney") + hours(2)
[1] "2023-04-02 03:00:00 AEST"

How do you pick between durations and periods?

  • As always, pick the simplest data structure that solves your problem.
  • If you only care about physical time, use a duration.
  • If you need to add human times, use a period

How do you pick between durations and periods?

  • As always, pick the simplest data structure that solves your problem.
  • If you only care about physical time, use a duration.
  • If you need to add human times, use a period.
ymd("2023-03-01") + years(1) # period
[1] "2024-03-01"
ymd("2023-03-01") + dyears(1) # duration
[1] "2024-02-29 06:00:00 UTC"

Holiday and Seasons

  • Holidays and Seasons are features commonly discussed.
  • Here are some packages and functions that may be useful in this area.
tis::holidays(2024) # US holiday
    NewYears       MLKing   GWBirthday     Memorial   Juneteenth Independence 
    20240101     20240115     20240219     20240527     20240619     20240704 
       Labor     Columbus     Veterans Thanksgiving    Christmas 
    20240902     20241014     20241111     20241128     20241225 
tsibble::holiday_aus(2024) # Aus Holiday
# A tibble: 7 × 2
  holiday        date      
  <chr>          <date>    
1 New Year's Day 2024-01-01
2 Australia Day  2024-01-26
3 Good Friday    2024-03-29
4 Easter Monday  2024-04-01
5 ANZAC Day      2024-04-25
6 Christmas Day  2024-12-25
7 Boxing Day     2024-12-26
lubridate::quarter(ymd(tis::holidays(2024)))
 [1] 1 1 1 2 2 3 3 4 4 4 4
hydroTSM::time2season(ymd(tis::holidays(2024)), out.fmt = "seasons")
 [1] "winter" "winter" "winter" "spring" "summer" "summer" "autumn" "autumn"
 [9] "autumn" "autumn" "winter"

Your mission, should you choose to accept it!

  • You will be working with dates and times in these exercises.
    • Note that answers can be achieved in multiple ways.
  • The oz_climate data (oz_climate.csv) contains result from a survey about attitude towards climate change in Australia.
# A tibble: 1,927 × 200
  RespondentID StartDate EndDate q1_1  q1_2  q1_3  q1_4  q1_5  q1_6  q1_7  q2_1 
  <chr>        <chr>     <chr>   <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
1 1502636269   08/04/20… 08/04/… Stro… Stro… Unsu… Mild… Unsu… Mild… Unsu… Unsu…
2 1502666184   08/04/20… 08/04/… Mild… Unsu… Mild… Mild… Mild… Mild… Stro… Unsu…
3 1502686727   08/04/20… 08/04/… Stro… Mild… Stro… Mild… Mild… Stro… Stro… Mild…
4 1502731096   08/04/20… 08/04/… Stro… Stro… Mild… Stro… Stro… Mild… Mild… Mild…
5 1502742259   08/04/20… 08/04/… Stro… Mild… Mild… Unsu… Stro… Mild… Stro… Stro…
# ℹ 1,922 more rows
# ℹ 189 more variables: q2_2 <chr>, q2_3 <chr>, q2_4 <chr>, q2_5 <chr>,
#   q2_6 <chr>, q2_7 <chr>, q2_8 <chr>, q3_1 <chr>, q3_2 <chr>, q3_3 <chr>,
#   q3_4 <chr>, q3_5 <chr>, q3_6 <chr>, q3_7 <chr>, q4 <chr>, q5 <chr>,
#   q6 <chr>, q7 <chr>, q7_other <chr>, q8 <chr>, q9 <chr>, q10 <chr>,
#   q11 <chr>, q12 <chr>, q13 <chr>, q14_1 <chr>, q14_2 <chr>, q15 <chr>,
#   q16_1 <chr>, q16_2 <chr>, q16_3 <chr>, q16_4 <chr>, q16_5 <chr>, …

Your mission, should you choose to accept it! (cont.)

  • The oz_climate_qbook data (oz_climate_qbook.csv) contains the translation of the column label in oz_climate to the actual question asked
oz_climate_qbook %>%
  print(n = 5)
# A tibble: 200 × 2
  code         desc                                                             
  <chr>        <chr>                                                            
1 RespondentID RespondentID                                                     
2 StartDate    StartDate                                                        
3 EndDate      EndDate                                                          
4 q1_1         We are approaching the limit of the number of people the earth c…
5 q1_2         Humans have the right to modify the natural environment to suit …
# ℹ 195 more rows
10:00

Objective A: Computing time difference

Compute the five number summary for the time taken to complete the survey for oz_climate by filling in … below.

oz_climate %>%
  select(StartDate, EndDate) %>%
  ...() %>%
  pull(Time) %>%
  fivenum() # or as.numeric() %>% summary()
oz_climate %>%
  select(StartDate, EndDate) %>%
  mutate(
    StartDate = mdy_hms(StartDate),
    EndDate = mdy_hms(EndDate),
    Time = EndDate - StartDate
  ) %>%
  pull(Time) %>%
  fivenum()

Objective B: Filter by date

Filter oz_climate to surveys that were completed on or after August 13th 2011.

oz_climate %>%
  mutate(...) %>%
  filter(...) %>%
  select(StartDate, EndDate)
oz_climate %>%
  mutate(EndDate = mdy_hms(EndDate)) %>%
  filter(EndDate > ymd("20110813")) %>%
  select(StartDate, EndDate)

Objective C: Convert string to date, datetime or period

Convert each string below to appropriate date, time or datetime objects.

...("May 2nd, 2014")
...("Jan 30th (2019)")
...("2020-12-29 08:30:27.243")
...("Jun 28 2018 8:40AM")
...("8:40") # 8 hours and 40 minutes
mdy("May 2nd, 2014")
mdy("Jan 30th (2019)")
mdy_hm("Jun 28 2018 8:40AM")
hm("8:40")

Objective D: When was it?

Caution

  • What day of the week was Christmas in 1990? (Return the day abbreviated)
  • What day of the year is it today?
  • What hour is it now in Singapore? Return as integer in 24-hour time.
wday(dmy("25/12/90"), label = TRUE)
yday(today())
hour(now(tzone = "Singapore"))

Skills

  • Introduction to R and RStudio
  • Working with data in R, particularly data.frame and tibble
  • Data wrangling with base R and dplyr
  • Join datasets
  • Manipulating date and time